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Abstract 

When providing probabilistic forecasts for uncertain future events, it is common 
to strive for calibrated forecasts, that is, the predictive distribution should be com¬ 
patible with the observed outcomes. Several notions of calibration are available in the 
case of a single forecaster alongside with diagnostic tools and statistical tests to assess 
calibration in practice. Often, there is more than one forecaster providing predictions, 
and these forecasters may use information of the others and therefore influence one 
another. We extend common notions of calibration, where each forecaster is analysed 
individually, to notions of cross-calibration where each forecaster is analysed with 
respect to the other forecasters in a natural way. It is shown theoretically and in sim¬ 
ulation studies that cross-calibration is a stronger requirement on a forecaster than 
calibration. Analogously to calibration for individual forecasters, we provide diagnos¬ 
tic tools and statistical tests to assess forecasters in terms of cross-calibration. The 
methods are illustrated in simulation examples and applied to probabilistic forecasts 
for inflation rates by the Bank of England. 


1 Introduction 


In the past decades probabilistic forecast, specifying a complete predictive probability 
distribution for an uncertain future event, have replaced point forecasts in a number of 


applications including weather forecasting, climate predictions and economics; see Gneit- 


ing and Katzfuss (2014) for a recent overview, 
et al 


Murphy and Winkler (1987); Gneiting 


(2007) have formulated the guiding principle for a probabilistic forecast to “Maxi- 

Calibration refers to the statistical compatibility 
Sharpness, on the other hand, is a property that 


mize sharpness subject to calibration” 
of the forecasts and the observations, 
concerns the forecast only. Roughly speaking, a forecast is sharper the more concentrated 
the distribution is, with point forecasts as a limiting case. Gneiting et al. (2007) have 
formulated their principle in order to pick the “better” of two calibrated forecasts. While 


it is generally acknowledged that forecasts should be calibrated (Dawid, 1984 Diebold 


et al. 1998), it is not universally accepted that it is necessary to consider sharpness as a 


further criterion for forecast evaluation (Mitchell and Wallis, 2011). 


In this manuscript we are concerned with calibration only. However, we consider a 
framework where several forecasters issue competing forecasts. We propose concepts of 
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cross-calibration in order to formalize the influence of forecasters amongst each other and 
with respect to the observations. Essentially, a cross-calibrated forecaster not only uses her 
own information optimally but also incorporates the information of the competing fore¬ 
casters in an optimal way. The notions we propose refine the existing notions of calibration 
of Gneiting and Ranjan (2013). Furthermore, we extend their prediction space setting to 
allow for serial dependence which is the usual situation in forecasting applications. We 


are able to extend the result of Diebold et al. (1998) of uniformity and independence of 


probability integral transform (PIT) values to our general framework. 

Notions of cross-calibration have previously been considered in the literature for bi¬ 


nary or categorical outcomes. Al-Najjar and Weinstein (2008) consider a test which an 


uninformed forecaster cannot pass with high probability when an informed forecaster is 


present. The notion of cross-calibration by Feinberg and Stewart (2008) takes into account 


that several forecasters may influence each other, and the one with the largest information 
set should be preferred. In this paper we generalize the cross-calibration notions of Fein¬ 
berg and Stewart (2008) to forecasts of real valued outcomes including diagnostic tools 
and statistical tests to assess cross-calibration in applications. 


The cross-calibration test suggested by Feinberg and Stewart (2008) uses the following 
framework, which we review here only in the case of two forecasters for simplicity. Let 
11 = {(wt)t=o,i....| wt G {0,1}} denote the space of all possible realizations and let n > 4 
be an integer. Divide the interval [0,1] into n equal closed subintervals [0,1/n],..., [(n — 
l)/n, 1]. At time t, forecaster j, j = 1, 2, makes a prediction which is given as an interval 
If G {[0,1/n],..., [(n — l)/n, 1]}. It gives bounds on the predictive probability that the 
next realization ujt is equal to one. The cross-calibration test is defined over the sequence 


of forecast-observation triples (if, if, u^)£fl 0 . 
any time T, let 

T 

\h ~ 1 


For any pair £ = (£i,£ 2 ) G {1,... ,n} 2 and 


zzy = 


E* 

t =0 


n = 


n 


£\ 


n 


t 2 — 
< 1 t ~ 


£2 — 1 £2 


n 


n J 


which is the number of times up to T, that the forecasting profile £ is chosen. For Vj, > 0, 
the frequency of realizations equal to one conditional on the forecasting profile is given by 


1 


& - 


't t =0 


h-l h 

n ' nJ 


T 2 — 

1 1 t ~ 


£2-1 £2 
n ’ n J 


A forecaster j passes the cross-calibration test at the outcome (if , I 2 , if 


lim sup 

T—¥ OO 


f i Uj ~ 1 
■’ T 2 n 


1 

< — 
2 n 


for every £ satisfying limr^oo z4 


= 00. 


It is shown in Feinberg and Stewart (2008) that a forecaster who is aware of the 


distribution of 0 passes the cross-calibration test with probability one, no matter 

which strategy the other forecaster uses. From a theoretical point of view, this is an 
interesting result. However, testing empirically if a forecast is cross-calibrated is rather 
difficult. The problem is that already if n = 5, there are 25 forecasting profiles to consider. 
For each of these profiles the empirical frequency conditioned on that profile should lie 
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T 

Monte-Carlo power 

10 4 

0.112 

5 • 10 4 

0.254 

10 5 

0.333 

5 • 10 5 

0.699 

10 6 

0.847 

5 • 10 6 

0.994 


Table 1: Monte-Carlo power of detecting cross-calibration for different time periods T. 


inside the predicted interval of the cross-calibrated forecaster. But some profiles are hardly 
ever predicted and therefore the number of observations needs to be very large. We 
illustrate this problem with the following simple simulation example, which has been 
implemented in R (R Development Core Team, 2008) like all further simulations in this 
paper. 


Example 1.1. In this example we consider the setting of the cross-calibration test de¬ 
scribed above. Let Bt, Ct, t = 0,..., T be independent beta random variables with 
parameters (3,5) for B t and (2,1.5) for Ct- We simulate a (finite) stochastic process 
(uit)t = 0 ) where ut is conditionally Bernoulli distributed with probability (B t + Ct)/ 2. Let 
n = 5. The first forecaster predicts at each time t the interval 1} which contains the value 
(Bt + Ct)/ 2, which is the probability that the realization u>t is one. The second forecaster 
predicts the interval If which contains the value Ct- Therefore, we expect that the first 
forecaster is cross-calibrated with respect to the second forecaster and should pass the 
test. The first forecaster passes the test if for all forecasting profiles £ = (^ 1 ,^ 2 ) where 
Mj. is positive fj, lies in [(^1 — l)/n,£i/n\. In Table [I] we see the result. For several T 
we performed the test 1000 times. The second column of the table gives the Monte-Carlo 
power of the test, that is, how often the test detected the cross-calibration of the first 
forecaster divided by T. 


Table [l] shows that already for this rather simple example, the sample size T needs 
to be large in order to come close to the theoretically predicted power of one. Further¬ 
more, the test is only applicable to probabilistic forecasts for binary outcomes. The goal 
of this paper is to extend the notion of cross-calibration to probabilistic forecasts of real 
valued quantities, and present methodology to empirically assess cross-calibration for se¬ 
rially dependent forecast-observation tuples. We have chosen to work in the framework of 
prediction spaces as introduced by Gneiting and Ranjan (2013), and extend it to allow for 
serial dependence. 

The paper is organized as follows. In Section [2] we review and extend the notion 
of a prediction space and generalize the notions of calibration for individual forecasters 
to multiple forecasters. We introduce diagnostic tools for checking cross-calibration and 
illustrate their usefulness in a simulation study in Section [3} In Section [4] we treat the 
special case of binary outcomes and relate our work to the existing results of Feinberg and 


Stewart ( 2008). Statistical tests for cross-calibration are derived in Section [5] We analyse 


the Bank for England density forecasts for inflation rates in Section [6j Finally, the paper 
concludes with a discussion in Section 0 
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2 Notions of cross-calibration 


Gneiting and Ranjan (2013) introduced the notion of a prediction space as follows. 


Definition 2.1 (one-period prediction space). Let k > 1 be an integer. A prediction 
space is a probability space Q) together with sub-c-algebras A\, ■ ■ ■ ,Ak C A. The 

elements of 0 are tuples of the form (T\,..., F^, Y, V) such that, for i = 1,..., k, Fi is a 
CDF-valued random quantity that is measurable with respect to A, 0 Y is a real-valued 
random variable, and V is a uniformly distributed random variable on [0,1], independent 
of Ai,... ,Ak,Y. 


The integer k corresponds to the number of forecasters. The a -algebra A % can be seen 
as the information set available to forecaster i. The random variable Y is the observation, 
the random variable V is needed for technical reasons. It allows to define the probability 
integral transform (PIT) in Definition 2.6 below. 


We term the prediction space proposed by Gneiting and Ranjan (2013) a one-period 
prediction space as it is only concerned with predictions for an outcome Y at one time 
point. While this framework is sufficient to define various notions of calibration and cross¬ 
calibration of forecasters in principle, a statistical analysis of calibration is only possible 
if we can assume that we have a sequence (Fi jU ,..., Fk jn , Y n , Vn)i<n<JV of independent 
forecast-observation tuples. This assumption is unrealistic in most forecasting situations. 
Therefore, we propose to extend the prediction space setting, allowing for serial dependence 
as follows. 


Definition 2.2 (prediction space for serial dependence). Let k > 1 be an integer. A 
prediction space for serial dependence is a probability space (D, A, Q) together with filtra- 
tions (-4i,t)teN,... (Ak,t)teN C A. The elements of O are sequences of tuples of the form 
(Fif, ■ ■ ■, Fk t t, b+i, Vt)teN; where (l*)tgN is a sequence of real-valued random variables, 
and (Vt)tgN is an iid sequence of standard uniform random variables that is independent 
of everything else. Let % = cr(Y s \ s < t) be the u-algebra generated by the observations 
until time t. For all t E N and i = 1,..., k, F t j is a CDF-valued random quantity that is 
(f{Ai t ti 7j)-measurable. We assume that, for all t € N, m > 1, 


Tftt- 1-1 | Alj+mi ■ • • i 7 1) — £(Yt +1 | A\ t t, • ■ •, Ak,ti %)} 


( 1 ) 


where C(X \ Q ) denotes the conditional law of a random variable X with respect to the 
a-algebra Q. 


The notation in Definition 2.2 is chosen such that Ai t encodes the information of the i- 


th forecaster Fj f at time t to predict the outcome Tj+i at the next time point. Additionally, 
all forecasters Fi t, ■ ■ •, have access to the past realizations of Yj in principle, that is, 
to the information contained in Tt- This means, we have separated the information of 
forecaster Fi into two parts, the information of past realizations of the outcome Tt, that is 
available to all forecasters, and a personal information set A t ,t that she acquires (partially) 
from other sources. Condition ([Tj) formalizes that information from other sources about 


lr That is, for all finite collections an,..., 
j = l,...,n} € Ai. 


B\,, B n 


the event {Fi(xj) G Bj for 
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the outcome at time point t + 1 + m should not influence the outcome lj+i at time points 
t + 1. A sufficient condition for ([Tj) to hold is that A^t+m = ^(Ai^t, Bi,t+m) with Bij+m 
independent of At : t and 7*. 

Let us illustrate this point in the context of weather forecasting. Suppose a numerical 
weather prediction system is used to calculate the state of the atmosphere to help us 
predict temperature tomorrow. Condition ([Tj) means that if we let the numerical system 
run longer to give us also information about the atmosphere the day after tomorrow, this 
will have no influence on what temperature is realized tomorrow. 

All further statements are within the prediction space setting and expressions such as 
almost surely are with respect to the probability measure Q. In the prediction space for 
serial dependence, is termed ideal with respect to Aij if 

F i)t = C(Y t+1 \A it t,Tt) almost surely. 


In the case of independent forecast-observation tuples, we recover the definition of an 
ideal forecaster of Gneiting and Ranjan (2013), that is, in the one-period prediction space 
setting, Fi is ideal with respect to A, if 


Fi = C(Y\Ai) almost surely; 

see also Tsyplakov (2011, 2013). We generalize this notion as follows. 


Definition 2.3 (cross-ideal). In the prediction space setting for serial dependence, we call 
Fij cross-ideal with respect to Ai t t, ■ ■ •, Ak,t if 


Fi jt = C,(Y t+ 1 \Aij, ■ • •, Ak,t, %) almost surely. 


( 2 ) 


A cross-ideal forecaster does not only use her own information optimally but also 
the information available to the other forecasters. In fact, at time t, her information A%.t 
contains all relevant information of all the forecasters because Fj f is c{Ai ti T)-measurable 
and hence by <§> also jC,(Y t+ i\Ai t t, ■ ■ ■ ,Ak,t,%) is u(Tj i t,7t)-measurable, implying that 
C{Yt+i\A\j,i ■ ■ ■ ,Ak,t,Tt) = C(Yt + i\Ai t t,Ft)■ Therefore, each cross-ideal forecaster is ideal, 
whereas the converse does not hold in general; see Examples 2.5 and 3.1 The above 
argument shows more generally the following proposition. 

Proposition 2.4. For some i E N, let F\j, • • •, Fk -t be forecasters with information sets 
A± } t, ■ ■ ■ ,Ak,t in a prediction space for serial dependence. If F\ t is cross-ideal with re¬ 
spect to Aij, ■ ■ ■, Ak,t, then it is also cross-ideal with respect to Ai t t, Ai 2t t, ■ ■ ■, Ai mi t, where 

{^2) • ■ ■ ) inn } C {2, . . . , k}. 

For clarity, we have chosen to illustrate the notions of cross-ideal forecasters (or cross- 


calibrated forecasters; see Definition 2.7) with independent forecast-observation tuples, or, 


in other words, in the one-period prediction space setting of Gneiting and Ranjan (2013) 
dropping the time index t. This is natural, as the notions of calibration are essentially 
one-period concepts, and make no use of assumption (jT|) . The purpose of assumption (jTj) 
will become clear in Theorem 2.11| below where we generalize the result of Diebold et al.| 
(1998) on uniformity and independence of PIT values. 
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Example 2.5. Let v be uniformly distributed on (5,20) and, conditionally on v , let a 2 
have an inverse chi-squared distribution with v degrees of freedom. Conditional on v and 
a, the outcome Y is normally distributed with mean zero and variance a 2 , and we consider 
two forecasters, a normally distributed forecaster F\ = Af(0, cr 2 ) and a t-distributed fore¬ 
caster F 2 = t u . This example is constructed such that F\ has the full information about 
the distribution of the outcome Y, whereas F 2 only knows the prior distribution of a 2 . We 
have that F\ and F 2 are both ideal with respect to to their information sets Mi = <t(< 7 2 ) 
and A 2 = o'(i'), respectively, but only Fi is cross-ideal with respect to Mi, M 2 . 

More specihcally, the predictive density function /i(-|ct 2 ) of F\ is a normal density with 
variance a 2 , and the predictive density function /2 of F 2 is 



h{x\s)g(s\u) ds 


rm x 2 \~* 

v /n7rr(n/2) \ v ) 


( 3 ) 


where g(-\is) = [y j‘Z) v l 2 s v ^ 2 ^ 1 exp {—v / [2s)}/T(v/2) is the density function of an inverse 
chi-squared distribution with v degrees of freedom. The right hand side of ([ 3 ]) is the 
density of a t-distribution. Equation ([ 3 ]) holds because for a normal likelihood with known 
mean, the inverse chi-squared distribution is a conjugate prior of a t-distributed posterior 
distribution. Therefore, we see that F\ is cross-ideal with respect to Mi,M 2 . It is clear 
that F 2 is not cross-ideal with respect to Mi,M 2 . We will come back to this example 
throughout the paper. 


The most prominent diagnostic tool for checking calibration empirically is the proba¬ 
bility integral transform (PIT) ( Dawid[ 1984; Diebold et al. 1998). 


Definition 2.6 (PIT). Let F be a (possibly random) CDF, X be a random variable and 
V a standard uniform random variable independent of F and X. We define 


?f = F(X-) + V{F(X) - F(X—)j, 


where F(y —) = lim^ F(x). In the prediction space setting, the random variable Z t t '■= 
Zp') 1 is called the probability integral transform (PIT) of the i-th forecaster T) (. 


The PIT Zi t is a random variable with values in [0,1]. If F is deterministic and 
X ~ F, then Zp is uniformly distributed and F~ 1 (Zp) = X almost surely, where F ~ 1 
is the quantile function of F: see for example Riischendorf (2009 ). Based on the PIT we 
introduce the following notions of cross-calibration. 


Definition 2.7 (cross-calibration). Let F\ i,.... iq. t be forecasters in a prediction space 
for serial dependence. Let {ii,..., i m } C {1,..., k}. 


1. The forecast F\j t is cross-calibrated with respect to Tq,*, ..., F lrn ,t if 


£(Zi,t\F iljt , • • •, F imjt ,Tt) =U([ 0 , 1]), almost surely, 
where U([ 0,1]) denotes a uniform distribution on [0, 1]J^] 

2 To be precise, the left hand side is a Markov kernel n : P x Z3(R) —¥ [0,1], which is required to be 
constant in u G Q and equal to the Lebesgue measure on [0,1]. 
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2. For 1 < j < k, F\j is marginally cross-calibrated with respect to Fjt if 


\t(y) = Eq1{^ (Z ltt ) < y}, 


for all y € 


For brevity, we sometimes speak of cross-calibration with respect to {*i,..., i m } instead 
of Fi lt t, ..., F lrn t- Our definitions are natural generalizations of the notions of calibration 


for individual forecasters in Gneiting and Ranjan (2013, Definition 2.6), which we recall 
here fore ease of comparison. 

Definition 2.8 (calibration). Let F be a forecaster in a one-period prediction space. 

1. The forecast F is probabilistically calibrated if Zp is uniformly distributed on [0,1]. 

2. The forecast F is marginally calibrated if E<Q.F(y) = Q(T < y) for all y E M. 

In part [2] of Definition |2.8| the left-hand side of the equation depends only on the 
distribution of the forecast, whereas the right-hand side depends only on the distribution 
of the observation. Marginal calibration therefore assesses whether the average forecast 
distribution is equal to the marginal distribution of Y. If Fi t is marginally cross-calibrated 
with respect to Fjt, then, on average, the PIT Z\t of F\ t behaves like a standard uniform 
random variable when considered in view of Fj t t- Intuitively, this means that F\ f has 
enough information about Fjj and the observation Yt +1 to disguise itself as uniform on 
average when viewed through the eyes of Fjt- 

Probabilistic cross-calibration means that the PIT Z\j of F\_t is uniformly distributed 
no matter what the other forecasters predict. In contrast, probabilistic calibration of F) 
means that Zp. is uniformly distributed on average over all possible predictions of the 
other forecasters, which is a weaker notion. 


Remark. Gneiting and Ranjan (2013) also formalize the concept of dispersion in their 


Definition 2.6 in terms of the variance of Zp. It is possible to define a notion of cross¬ 
dispersion for multiple forecasters considering the conditional variance of Z\t given F^ t, 
..., Ft t- However, we feel that formulating dispersion in terms of the variance of Zp 
is not as natural as it seems at first sight. If F is probabilistically calibrated then Zp 
is uniformly distributed on [0,1], therefore its variance is 1/12 and F is called neutrally 
dispersed. In this case, <F ~ l (Zp), would have a standard normal distribution, where d> _1 
denotes the quantile function of the standard normal distribution. It is equally intuitive 
to define dispersion in terms of the variance of <I>~ 1 ( Zp ) with over- and underdispersion if 
this variance is smaller or larger than one, respectively. If a random variable X with values 
in [0,1] has variance 1/12, generally, it does not follow that <h~ 1 (X) has unit variance. 
For example, let X be a beta distributed random variable with parameters a = 1 and 
(3 = (V33 — 5)/2. Then X has variance 1/12 ~ 0.083. The variance of <h” 1 (X) is 
approximately 1.92 ^ 1. Therefore, it may well be that a forecast is neutrally dispersed 
with respect to Zp but over- or underdispersed with respect to &~ 1 (Zp). Due to this 
ambiguity, we do not consider the concept of (cross-)dispersion in this manuscript. 


The following theorem formally connects Definitions |2.7| and 2.8 showing that the 
former is indeed a generalization of the latter. 
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Theorem 2.9. Consider forecasters Fit ,..., F^ t in a prediction space for serial depen¬ 
dence. 


1. The forecast F\ )t is marginally cross-calibrated with respect to itself, if and only if 
Fn Is marginally calibrated. 

2. If Fij is cross-calibrated with respect to Fi u t,..., Fi m ,t, then Fgt is cross-calibrated 
with respect to any subset of {i\,... ,i m }. In particular, F\ t is cross-calibrated with 
respect to the empty set 0, that is, probabilistically calibrated. 

3. If Fit is cross-calibrated with respect to then it is also marginally cross-calibrated 
with respect to 

Proof. To show the first claim, observe that we have for all y E M, 

EqI^- 1 ^) <y} = Q[F lit (Y t+1 -) + V{F 1 , t (Y t+1 ) - F u {Y t+ i~)} < F 1>t (y)\ 

= Q{F ht (Y t+1 ) < F u {y)} 

= Q(Yt +i < y)- 

The second equality holds, because 


Zi, t = F 1 , t (Y t+ 1-) + V{F lit (Y t+1 ) - F h t(Y t+1 -)} E [Fi, t {Y t+ i-), F ljt (Y t+1 )], 

where the interval consists of the point Fi tt (Y t+ i—) = Fi^iXt+i) if F\ t is continuous at 
the point Y t+ \, and Zi it E (FijfYt+i— ), F\j. almost surely, otherwise. Furthermore, 

F\,t{y) < F lit {Y t+ i~) or F Xit {y) > Fi >t (Y t+ i). Let J C {h, ■ ■ ■ ,i m }. The second claim 
follows because, for y E (0,1), 


Q{Zi,t < V | F i}t , i€ J,%) = 


l ,t < y I Fi 1 ,... ,F im ,T t ) I F ijt ,i E J,T t } 


= E<Q (y I F it t,i eJ,Tt) = y 


by the definition of cross-calibration. The last claim holds because 

IEqI {F 2j t(Zi,t) < y} = < F 2 } t(y) \ ^2^,7*} = EqF 2 j(y). 


□ 


It is possible that a forecaster is marginally calibrated but not probabilistically cali¬ 


brated; see Gneiting and Ranjan (2013, Example 2.4) which we take up below in Example 


3.1 to illustrate cross-calibration. Conversely, the last claim of Theorem 2.9 shows that 
marginal cross-calibration with respect to a different forecaster is a necessary condition 
for cross-calibration. 


Tsyplakov (2011 


2013) introduced a slightly more restrictive notion than an ideal 
forecaster, which is an auto-calibrated forecaster, that is, it fulfils C(Y \ F) = F, almost 
surely, in the one-period prediction space setting. Generally, an auto-calibrated forecaster 


is ideal with respect to &(F), which is the e-algebra generated by F. Gneiting and Ranjan 


(2013) contend that it is unlikely that empirical test of auto-calibration are feasible, except 


for very special circumstances such as forecasts for binary random variables. In cases where 














forecasters are restricted to specific classes of distributions Held et al. (2010) have taken on 


the challenge to derive statistical tests for ideal forecasters in the sense of auto-calibration 


based on a score regression approach; for earlier work in this direction see Hamill (2001); 


Mason et al. 

(2007 

). In Section 

5.3 

we show that it is possible to extend the score regression 

approach of 

Held et al. 

(2010) to test for cross-calibrated forecasters, that is, for cross-ideal 


forecasters with respect to cr(Fi),... ,o (Ffc); compare Proposition 

2.10 


In this paper, we challenge the statement of Gneiting and Ranjan ( 

2013) by proposing 


two powerful tests for cross-calibration under very general assumptions that are justified 
even under serial dependence; see Sections 5.1 and 5.2 Note that the following Proposition 


2.10 shows that auto-calibration is in fact a special case of cross-calibration. 


Proposition 2.10. Consider forecasters ..., Fkj in a prediction space for serial de¬ 
pendence. Let i m } C {1,..., k}. Then, the following are equivalent: 


1. The forecaster F\ jt is cross-calibrated with respect to Fiut, 


,Fi. 




2. For all z G [0,1), conditional on Fi m}t ,%, the random variable < z} 

is Bernoulli distributed with parameter z. 

If 1 G {ii,... ,i m }, then part one and two are equivalent to F\_t being cross-ideal with 
respect to a(F h j ),..., cr(F w ). 

Proof. The equivalence of parts one and two is immediate from the definition of cross¬ 
calibration. Suppose now that 1 G {*i,.. ., i m }. For all y G M, we obtain 


Q(Fj+i < y | Fi ut , ■ • •, Tt) — Q{F l t (Zij) < y\ F il)t , 


t,%} 


— Q{Zi, t < Fi it (y)\ F il)t ,... ,F irnit ,Tt} — Fi <t (y), 


which shows that last claim. 


□ 


We conclude this section with the announced generalization of the result of Diebold 


et al. (1998) on uniformity and independence of PIT values in a prediction space for serial 


dependence. 


Theorem 2.11. Suppose we are in the prediction space setting for serial dependence. Let 
{*i, ■ ■ -,im} c {1 ,... ,k} and assume that F ljt = C(Y t+1 \A iljt , ■ ■ .,A irnt t,Tt) for all i G N. 
Then, for all l G No, we have 

£(Zi,t, • • •, Z ljt+ i\Ai l!t+ i, ... ) = W([0, l]) 0(i+1) , almost surely, 

for all i G N. Here, U{[ 0, l])®!^ 1 ) denotes the distribution of l + 1 independent standard 
uniform random variables. 

Proof. We define Bt := a(Ai lt t, ■ ■ ■ For u = (uq, ...,u{) G (0, l) z+1 , we obtain 


E{1(Z U < u 0 ) • • • l(^i,t+; < ui) | B t+i } 

= E[l(Zi )t < uo) ■ ■ ■ l(Zi it +i-i < ui-i)E{l(Zi tt+ i < ui) | B t +i,Tt+i} | B t +i\ 

= E[l (Z\ it < uo) ■ ■ ■ l(Zi tt +i -2 < ui-2)E{1(Zij+i-i < ui- 1 ) | B t +i,Tt+i-i} | B t +i]ui 

= E[l(Zi j t < uo) • • • t(Zij+i -2 < ui-2)E{1(Zij+i-i < ui- 1 ) | B t +i-i,Tt+i-i} \ B t +i]ui 

= E[l(Zi it < ii 0 ) • • • t(Z l)t+ i_ 2 < ui_ 2 ) I B t+ i_i]ui_iui = ■■■ = u 0 ---ui, 
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where we used condition ([I]) to obtain the third equality, and then proceeded iteratively. 

□ 

Remark. If we consider g-step ahead forecasts for some g > 2, then the above result 
continues to hold for all vectors of the form 

{Z\ ,t; ■ i Z\ t t+mq)- 


However, there may be dependence amongst (Zij, Z\ t t+ 1 , • ■ •, Zij+ q -i), which complicates 
matters when testing for cross-calibration. This problem also arises in tests for uniformity 


and independence of PIT values as suggested by 

Diebold et al. 

(1998 

. Several approaches 

to deal with this issue have been suggested in 

the literature; see 

Kntippel ( 

2015 

) and 


references therein. In this paper, we restrict our attention to cross-calibration of one- 
period ahead forecasts but extensions to g-step ahead forecasts would certainly be of great 
interest. 


3 Diagnostic plots for assessing cross-calibration 

Gneiting et al.| (2007| suggest to assess marginal calibration based on a plot of the empirical 
analogue of the difference 


(Ft(y)) - QCb +1 < y), for y e M. 


Analogously, to assess marginal cross-calibration, the empirical version of 


E Q F jAv) - E Q 1 { F j,t 1 ( z i,t) < y\i for v e 


( 4 ) 


can be plotted. If the graph is not equal to zero everywhere one can deduce that F tj i is not 
marginally cross-calibrated with respect to Fj t and therefore also not cross-calibrated with 


respect to {j} by Theorem 2.9 If the graph is zero everywhere, then we have marginal 
cross-calibration. However, this does not necessarily imply that we have a cross-calibrated 
forecaster. 

Probabilistic calibration is often checked empirically by plotting a histogram of Zit, 
the so-called PIT-histogram. Generally, it is not obvious how to check cross-calibration 
empirically. However, in many situations of practical interest it can be done by borrow¬ 


ing the idea of considering forecasting profiles as in the cross-calibration test of Feinberg 


and Stewart (2008). Suppose that the forecasters iq,..., F^ pick predictions from some 

d 


parametric class of distributions T = {F\ | A G A}, where A C M d . Then we can 
identify each forecaster Fi t with the parameter Ashe predicts. We observe a sample 
(Fij ,..., Fk jt , Yf+i, Vt) for 1 < t < N. Let Ai,...,A p be a partition of the parame¬ 
ter space. For a diagnostic plot showing if Fi jt , say, is cross-calibrated with respect to 
{* 1 ,... ,i m }, we can sort the observations into pm bins according to the predicted values 
(A i li t ,..., A i mt t)- Then a PIT-histogram of Z\j can be plotted for each bin. Clearly, the 
number of bins needs to be small in relation to the number of observations. 

We illustrate these diagnostic tools with two examples. The first one has been proposed 


by Gneiting and Ranjan (2013, Examples 2.4); see also Gneiting et al. 
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Forecaster 

Predictive distribution 

Information set 

Perfect 

Climatological 

Unfocused 

Sign-reversed 

Fi = Ar(n, 1) 

F 2 = Af(0,2) 

F 3 = \{Af{^, 1) + Af (/x + r, 1)} 
F4 = A /'(-/x, 1) 

Ai = cr(/x) 

A 2 = {0, U} 

A 3 = u(/x,r) 

A 4 = cr(/x) 

Forecaster 

Cross-calibration 

Marginal cross-calibration wrt 
F\ F 2 F3 F 4 

Perfect 

Climatological 

Unfocused 

Sign-reversed 

wrt Fi,F 2 ,F 3 ,F4 
wrt F2 

wrt Fi,F 2 ,F4 

no 

yes yes yes yes 

no yes no no 

yes yes no yes 

no no no yes 


Table 2: Properties of the forecasters of Gneiting and Ranjan (2013, Example 2.4). Further 
details are given in Example 3.1 Cross-calibration with respect to (wrt) F 2 is equivalent 
to cross-calibration with respect to 0, that is, probabilistic calibration. 


Example 3.1. Let /x be standard normally distributed, which we denote by /x rs_/ Af(0,l). 
Conditional on fi, the outcome is Y ~ A/"(/x, 1). Let r take the values 1 or -1 with equal 
probability, independent of Y and ^ 1 . We consider four forecasters F\, ..., F4 of different 
skill, whose properties are summarized in Table [2} 

It is clear that the perfect forecaster F\ is cross-calibrated with respect to F\, F 2 , F3, F4. 
It is straight forward to check that the climatological forecaster F 2 is not cross-calibrated 
with respect to any of F\, F;$, F4 but with respect to itself. As F2 is deterministic, this 
corresponds to the fact that F 2 is ideal with respect to the trivial u-algebra. As the sign- 
reversed forecaster F 4 is not probabilistically calibrated it cannot be cross-calibrated. The 
cross-calibration of F3 with respect to Fi,F 2 , F4 is shown in Appendix |A| The statements 
about marginal cross-calibration in Table [2] are consequences of Theorem |2.9| 

In Figure[2]the differences given at Q are plotted. More precisely, the random variables 
are simulated lO’OOO times and the mean is given. Recall, that all for all simulation 
examples we are using independent forecast-observation tuples for reasons of simplicity. 
In this example, it is easy to see that F\ is superior to F2 using the notion of marginal 
cross-calibration, which was not the case using only the calibration notions of Gneiting 


and Ranjan 

(2013 

Definition 2.6); see 

Gneiting et al. 

(2007 


As an example for checking cross-calibration empirically we note that all four forecaster 
are in the class of distribution functions T = {TaI-^ = (/b cr, r) e Rx (0, 00) x{— 1 ,0 , 1 }} for 
F\ = ;j{A <r)+Af cr)}. We plotted the PIT-histograms of Z2 and Z3 conditional on 
the four bins /x G Ij for 1 < j < 4 with I\ = (—00, —0.67), I2 = [—0.67,0), I3 = [0,0.67), 
I 4 = [0.67, 00). The PIT-histograms in Figure [l] confirm that F$ is probabilistic cross- 
calibrated with respect to F\, F2- F4. On the other hand, F2 is clearly not probabilistic 
cross-calibrated with respect to any set of the other forecasters. 


Example 3.2 (Example 2.5 continued). Coming back to the forecasters F\ and F 2 of 


Example 2.5 Theorem |2.9| implies that F\ is marginally cross-calibrated with respect to 
itself and with respect to F 2 . Furthermore, F 2 is marginally cross-calibrated with respect 
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Figure 1: PIT-histogram plots of F$ and F 2 conditional on the mean in the top and bottom 
row, respectively. 


to itself but F 2 is not marginally cross-calibrated with respect to F\. Marginal cross¬ 
calibration plots for this scenario using lO’OOO and lOO’OOO simulations are given in Figure 
[3j In this example, the lack of marginal cross-calibration can only be detected for an 
unrealistically large number of observations. 

PIT-histograms for assessing cross-calibration of F\ with respect to F 2 and of F 2 with 
respect to F\ for lO’OOO simulations are given in Figure [4j The partition of the param¬ 
eter space is chosen such that in each histogram there are around the same amount of 
observations. The lack of cross-calibration of F 2 with respect to F\ is clearly detected. 


4 Binary outcomes 


In this section we consider the case, when the observation Y only takes two values, zero 
and one. We interpret Y = 1 as a success and Y = 0 as a failure. A forecaster F is 
then represented by her predictive success probability p, such that the predictive CDF is 
F(y) = p ■ 1 {y > 1} + (1 — p) ■ 1 {y > 0}. We identify F with p, where p is a random 
variable taking values in [0,1]. 

In the case of an individual forecaster F for a binary outcome it has been shown in 


Gneiting and Ranjan (2013, Theorem 2.11) that the notions of a probabilistically cal¬ 
ibrated forecaster F and an ideal forecaster relative to the er-algebra generated by the 
predictive probability p are equivalent. Furthermore, both notions coincide with the no¬ 
tion of conditional calibration , that is Q(Y = 1| p) = p. This result carries over to the 
notions of cross-calibration of multiple forecasters introduced in this paper. As the notions 
of calibration are essentially only concerned with one prediction period, we have chosen to 
present the results of this section in the one-period prediction space setting of Definition 
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F\ F 2 F 3 F 4 



Figure 2: Marginal cross-calibration plots of the forecasters in Example 3.1 with lO’OOO 
simulations. In the i-th row and j-th column the empirical version of Equation ©> is 
plotted to assess whether Fi is marginally cross-calibrated with respect to Fj or not. 
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F\ F 2 



Figure 3: Marginal cross-calibration plots for the scenario in Example |2.5| for lO’OOO sim¬ 
ulation indicated by the dashed lines and lOO’OOO simulations indicated by the continuous 
lines. The i-th row and j-th column corresponds to the empirical version of Equation Q 
in order to deduce whether Fi is marginally cross-calibrated with respect to Fj or not. 


El(Y ) 1 1 / G h F 1 (Y)\veI 2 Fi(Y) I v G h 



F 2 (Y) I aeJ, F 2 {Y) \aej 2 F 2 {Y) \ a G J 3 



Figure 4: PIT-histogram plots of F\ conditional on the predictive degree of freedom of 
F 2 , where Ii = (5,10], I 2 = (10,15], I3 = (15,20] in the top row, and PIT-histogram plots 
of F 2 conditional on the predictive standard deviation of E\, where Ji = [0,0.95], J 2 = 
(0.95,1.1], J 3 = (1.1, 00 ] in the bottom row. 
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2.1 


for simplicity. 

Theorem 4.1. Consider the one-period prediction space setting with binary outcome Y 
and forecasts F \,..., F k represented by their predictive success probabilities pi,... ,p k , re¬ 
spectively. Then the following statements are equivalent: 

1. The forecast p\ is cross-calibrated with respect to p 2 ,■■■ ,Pk, that is C(Z Pl \ p 2 , ■ ■ ■ ,Pk) 
is standard uniform. 


2. The forecast p\ is conditionally cross-calibrated with respect to p\,...,pk, that is 

Q(Y = MPi,---,Pk) =Pi- 

3. The forecast p\ is ideal relative to a(pi,... ,p k ). 

The proof of Theorem 4.1 parallels the proof of Gneiting and Ranjan (2013, Theorem 
2.11). The following lemma gives a formula for the density function of Z pi conditional on 
Pi — %!’> • • • iPk ~ 

Lemma 4.2. The density function of Z pi conditional on p = x is given by 

u(z\ p = x) = ——- 1(1 — x\ < z < 1) H- ^ - 1(0 < z < 1 — xi), 

X'l 1 — X\ 

where p = (pi,...,p k ), x = (xi, ...,x k )e [0, l] k , x\ G (0,1) and q(x) = Q(Y = 1| p = x). 
Proof of Lemma \ f.2\ The PIT of p\ is 


Zp i — 


(l-pi)+PiV, if Y = 1, 
(l-pi)V, if Y = 0. 


Let 0 < z < 1, then 


Q(Z P i < -z I P = x) = Q{(1 - pi) + p\V < z, Y = 1| p = x} 

+ Q{(1 -pi)V < z,Y = 0 | p = x} 

= Q{(1 - Pi) + PiV < z | Y = 1, p = x}Q(Y = 1| p = x) 
+ Q{(1 — Pi)V <z\Y = 0,p = x}Q(Y = 0| p = x) 


z + x i — l 


+ 


Xl 

z 


g(x)l(l — xi < z) + {1 — g(x)}l(l — xi < z) 


1 — Xl 

1 ~ g(x) 

1 — Xl 


{1 - g(x)}l{l - Xl > z) 

ri(l - Xl > z) + (l - + «W 2 U(1 

l xi xi J 


— Xl < z). 


Differentiation yields the claim. 


□ 


Proof of Theorem \4-l\ It is easy to see that part two is equivalent to part three. By 
Theorem 2.9, part three implies part one. The remaining task is to prove that part one 
implies part two. Let H = p(Q) be the marginal law of the random vector p under Q. 
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Recall that g(x) = Q(U = 1| p = x). If H({0} x [0, l] fc x ) > 0, then g(0, X 2 ,..., x k ) = 0 
for all (x 2 , ■ ■ ■, Xk) G [0, l] fe 1 , because 

H({0} x [0, l] fc_1 ) = Qip-'dO} x [0, i] fc_1 )} = QK^O)} = Q(P 1 = 0), 

and furthermore, 


Q{Z Fl = l\p -2 = X 2 , ■ ■. ,Pk = X k ) > Q{Z Fl = 1 ,Y = l,Pi = 0\ P 2 = x 2 ,...,Pk = Xk) 

= Q(Y = l,pi = 0 | p 2 = x 2 ,-..,Pk = x k ) 

= Q{Y = 1 \pi = 0,p 2 =x 2 ,.. -,Pk = x k )Q(pi = 0) 

= q(0,x 2 , ■ ■ -,x k )Q(pi = 0). 

We know that Q(Z pi = 1| p 2 = x 2 , ■ ■ ■ ,Pk = Xk) = 0, because C(Z pi \ p 2l . . ■ ,Pk) is standard 
uniform. This implies that r/(0, x 2 , ■ ■ ■, Xk) = 0. Similarly one can show that H({1} x 
[0, l] fe_1 ) > 0 implies q( 1, x 2 , ■ ■ ■, Xk) = 1 for all (x 2 , ■.., Xk) G [0, 

Using that C(Z pi \p 2 ,. ■. ,Pk) is a standard uniform distribution and Lemma 
have for a.a. z G (0,1), 5 E (0,1 — z) 

0 = u(z + 5\p 2 = X 2 ,... ,Pk = Xk) - u(z\p 2 =X 2 ,.. ■ Pk = X k ) 

{u(z + 51 p = x) — u(z\ p = x)} dHi(xi) 

g(x) - X\ 


4.2 


we 


'[ 0 , 1 ] 


l[l-z-s,l-z) x l(i- _ x l) 


dHi(xi), 


where Hi = pi(Q) is the marginal law of pi under Q. We define the signed measure p for 
a given (x 2 ,..., x k ) G [0, l]^ 1 as 


V{A)= f —^-dHi(xi), 

J A xi{l - Xl) 


for all Borel sets A C [a, b\, where 0 < a < b < 1 . For [c, d) C [a, b] we have shown before, 
that 

MM))= [ q{ ^~ x \ dMi(xi) = o. 
j[c,d) 24 (T - Xl) 

Therefore, p{B) = 0 for all B E B([a,b\). In particular, {.xi E [a,b]\q(x.) > xi} and 
{xi E [a, b] | q(x) < xi} are Hi null sets and we have q(x) = xi Hi-a.s., hence, 

=xi)} = {w : q{pi(u),x 2 ,...,x k } = Pi{u)} 

= {Q(b" = 1 \pi,P 2 = X2, ■■■Pk= X k ) = Pi} 

has Q-measure 1 for all (x 2 ,... ,x k ) G [ 0 , l] fc_1 . Therefore, Q(U = lj p) = p\ Q-a.s.. 


□ 

The cross-calibration notion of Feinberg and Stewart (2008) is analogous to our notion 
of cross-calibration with respect to {1,..., A;} which is equivalent to cross-ideal forecasters 
for binary events. Theorem |4. 1| shows that both notions coincide with cross-calibration of 


pi with respect to {2,..., k} which is a priori a weaker requirement. As noted by Gneiting 
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and Ranjan (2013) the fact that probabilistically calibrated forecasters are automatically 


ideal clarifies the relation between PIT-histograms and calibration curves which are the 


diagnostic tool frequently used for assessing calibration of binary predictions (Dawid 1986 
Murphy and Winkler 1992; Ranjan and Gneiting, 2010). As described in Section |3j cross¬ 
calibration can be assessed with conditional PIT-histograms. Analogously, in the case of 
binary forecasts, conditional calibration curves can be considered. 


5 Tests for assessing cross-calibration 

In this section we consider statistical tests for cross-calibration. The tests in Section l5Tl 


are 


based on the idea of conditional exceedance probabilities (Mason et al. 2007), whereas the 


tests in Section 5.2 use a linear regression approach. Finally, the score regression approach 


by Held et al. (2010) to test for ideal forecasters is reviewed and extended to a test for 


cross-ideal forecasters in Section 


5.1 Conditional exceedance probabilities 

Suppose we have observations Fij, ..., Fk,t and Y t+ 1 , 1 < t < N in a prediction space for 
serial dependence. We would like to test the null hypothesis that Fqt is cross-calibrated 
with respect to J C {1,..., k}. For 2 E [0,1), we define B z t ■= l{Zi * < z}. Under the 


null hypothesis, using Proposition 2.10 and Theorem 2.11 conditional on Fi t f° r * € J, the 
random variables R Z) i,..., B z ^ are independent Bernoulli random variables with success 
probability z. We stipulate the logistic regression models 


logit [Q{R z ,t = 1| F i t 1 (z),ie J}] = A ),2 + ^2,Pi,zF i} t{z) 


( 5 ) 


ieJ 


for each 2 6 [0,1), where logit(p) = log{p/(l — p)} is the logistic function. Using ([5]), the 
null hypothesis is 


Hq : p 0tZ = logit(2), A ,2 =0, i E J, for all 2 E [0,1). 
For each 2 E [0,1), we suggest to test the pointwise hypothesis 

Hq{z) : Po tZ = logit(2), Pi z = 0, i E J, 


( 6 ) 


( 7 ) 


by a likelihood ratio test yielding a p- value tt(z). 

More precisely, the covariate vector x z t has one as the first entry and then F^^z), 
i E J and the parameter vector (3 Z has entries Po jZ , Pi iZ , i E J. For values of 2 close to 
zero or one, we frequently encounter the phenomenon of separation, that is, the likelihood 
converges, but at least one parameter value is infinite. Therefore, we have chosen to use 


the method of Firth (1993), which always yields finite parameter estimates; see Heinze 
and Schemper (|2002[). That is, we fit the parameters Po jZ , i E J by maximizing the 


penalized log-likelihood function 


1 


£ p ((3 z ) = £((3 z ) + -log\I((3 z )\ 
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where 


N N 

t(0z) = Y B z,t*Z,t0z - Y + eX P( X ^i^)}> 

t= 1 t= 1 

and \I((3 Z )\ is the determinant of the Fisher information matrix. We denote the estimated 
parameter vector by (3 Z with entries /3o )Z , i € J. For N large enough, the test statistic 

T z = —2{£p(P z ) - £ p (ry g )} 


has a y 2 -distribution with 1 + \J\ degrees of freedom, where | J| denotes the cardinality 
of |J|, and 7 Z = (logit(z), 0,... ,0) T . We dehne the p-value i t(z) = 1 — Xi+\ j\ (^z), where 
X 2 + |j| denotes the cumulative distribution function of a y 2 random variable with 1 + |J| 
degrees of freedom. For the simulation studies below and the data analysis in Section [6] 
we have used the R-package of Heinze et al. (2013) to calculate T z . 

In order to draw conclusions about the global null hypothesis Hq at ([6]) from the 
pointwise p-values n(z), we adjust them for multiple testing. We follow the approach 
of Cox and Lee (2008) to use the method of Westfall and Young (1993, Chapter 2) for 
functional data to compute adjusted p-values r(z); see also Meinshausen et al. (2011). 

Let 0 < z\ < • • • < zm < 1- Under the null hypothesis of cross-calibration, it is 
possible to simulate a vector of p-values {tt*(z\), ... ,n *(zm)) with the same distribution 
as (7r(^i),... ,7 t(zm)) conditional on F~^{z m ), i£j,l<t<N,l<m< M, as follows. 
Let U\,..., Un be iid standard uniform random variables. For 1 < m < M, dehne 
B* m t = 1 (U t < z m ), and let 7r*(z m ) be the p -value from the pointwise likelihood ratio test 
for the simulated data vector (B* and covariates <t<N as before. 

The adjusted p-values can now be obtained as follows. Let a be the permutation of 
{1,..., M} such that < ■ ■ ■ < tt {^(m)}- This permutation a remains unchanged 

in the following procedure. For a simulated vector of p-values (tt*(zi), ... ,it*(zm)), we 
dehne q* n = min{7r*{2 (:r (, s )}: s > m}. Repeating this procedure L times, we obtain an array 
(^m i)i<m<M, i<i<l and dehne the adjusted p -values n ,..., corresponding to zi ,..., zm 
as 


r m L 


1 ' 

7 Y 1 K- 1 (m),i ^ 7r W}. 1 <m< 


M. 


i=i 


The global null hypothesis Ho at ([6]) can be rejected at level a E (0,1) if min{r m : 1 < 
m < M} < a. Furthermore, the adjusted p -values allow to draw conclusions for which 
values of z m E (0,1) miscalibration occurs. For example, a prediction method may perform 
satisfactory for the left tail of the distribution, that is, for z close to zero, the adjusted 
p-values are large, whereas it fails to capture the right tail and hence for z close to one, 
the adjusted p-values are small. We call this test the CEP test with respect to J. 

Remark. It is important to note that the adjusted p -values r m remain the same, if the 
pointwise p-values tt(z) are transformed with a strictly monotone transformation. There¬ 
fore, even if the ir(z) are only asymptotic p-values, the adjusted p-values r m will control 
the familywise error rate at the desired level a even for finite samples (for large numbers 


L of bootstrap replications); see Westfall and Young (1993, Chapter 2) and Cox and Lee 


(2008). It is nevertheless important which test statistic to choose for the pointwise tests 
as the power of the overall test will crucially depend on the power of the pointwise tests. 
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wrt 

Fi 

f 2 

f 3 

Fa 

Fi, F 3 

Fi, Fa 

F 3 , Fa 

F \, F'a, Fa 

Fi 

0.056 

0.051 

0.055 

0.055 

0.050 

0.051 

0.050 

0.048 

Fo 

0.997 

0.051 

0.979 

0.997 

0.994 

0.990 

0.993 

0.977 

F 3 

0.052 

0.052 

0.168 

0.051 

0.635 

0.051 

0.634 

0.582 

e 4 

1.000 

1.000 

1.000 

1.000 

1.000 

1.000 

1.000 

1.000 


Table 3: Monte-Carlo power for the CEP tests at significance level a 


the simulation study are given in Example 5.1 


0.05. Details of 


O 

LO 

II 

Fi 

f 2 

F\ ■ F 2 

Fi 

0.052 

0.053 

0.050 

f 2 

0.156 

0.051 

0.139 


N = 200 

Fi 

f 2 

Fi,F 2 

Fi 

0.050 

0.052 

0.051 

f 2 

0.533 

0.057 

0.464 


Table 4: Monte-Carlo power for the CEP tests at significance level a 


the simulation study are given in Example 5.2 


0.05. Details of 


Example 5.1 (Example|3.1|continued). We consider the forecasters T\,..., F$ of Example 


respect to all possible subsets of F\,... ,F± and calculated the Monte-Carlo power based 
on lO'OOO simulations. We used the gridpoints z m = {1 +(18/19)m}/20, 0 < m < 19. The 
number of bootstrap replications for calculating the adjusted p -values is set to L = 500. 
For data examples, L should be much larger. However, for analysing the performance of 
the resampling based p-values, it is more important to run a large number of simulations 
than to have a large bootstrap sample for each of them; see Westfall and Young (1993) 
for a more detailed discussion. The results are given in Table 3j 

Conditioning on F 2 corresponds to conditioning on the trivial u-algebra, therefore 
testing conditional on F\.F 2 . F% is the same as testing conditional on F\.F 3 . for example. 
Hence, Table [3] contains all interesting subsets of T\,..., F 4 and the column entitled ‘TV 
corresponds to a test for probabilistic calibration. The test performs well, even for the 
small sample size N = 50. Generally, the power of the test appears to increase, the more 
information is used. For example, the test has difficulty to detect that F 3 is not ideal with 
respect to itself but it performs well for rejecting the null hypothesis that F 3 is cross-ideal 
with respect to F±,F 3 , F 3 , F 4 or F\, F 3 , F 4 . 


Westfall and Young (1993 


3.1 see Table [2j For sample size N = 50, we performed the CEP tests for Ej..... E 4 with 


Example 5.2 (Example |2.5| continued) 


from the prediction space described in Example 2.5 


We applied the CEP tests to data simulated 
We used the same grid and other 


parameters as in the previous example, except that we considered two different sample 
sizes N = 50 and N = 200. The results from lO'OOO simulations can be seen in Table [H 
Here, the power for sample size IV = 50 is only small. Fortunately, it appears to increase 
rapidly with sample size and is satisfactory for N = 200. 


5.2 Linear regression approach 

To formulate the linear regression approach (LRA) tests for cross-calibration, we restrict 
ourselves to a parametric class of cumulative distribution functions F = {F \| A G A), 
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where A C M d . Suppose we have N forecast-observation tuples (F\ it , ..., F k} t, Y t , Vt ), 1 < 
t < N in a prediction space for serial dependence such that Fi t £ F for all i and t. Each 
forecaster F l t is then represented by its predictive parameter vector X t> t = (A^,..., A- f ^) 
for 1 < i < k. We want to test the hypothesis that F\ )t is cross-calibrated with respect to 
some J = {*i,..., i m } C {1,..., k}, for 1 < t < N. Proposition 2.10 and Theorem 2.11 
lead to the null hypothesis 


H 0 : £{d> 1 (Zi j i), ..., 1 (Zi j at)|Aj, j E J, 1 < t < N} = A/jv(0, Ijv), almost surely, 

where d*” 1 denotes the quantile function of a standard normal distribution and A/jv(0, In) 
denotes a multivariate standard normal distribution. In order to test this hypothesis we 
perform an F-test based on linear regression. We consider the linear model 


Y = Dj/3 + e, 


( 8 ) 


where 



II 

i 

\Z 1 , 1 ),. 

..,*- 1 {Z 1)N )) T € R N 

is the response vector, 




/ 1 

1 

Dj = 

A {1) .. 

A il,l 

A (1 ^ 

\i,2 

Ad) 
A h 1 
Ad) 
A h, 2 

\(i) \(^) \ 

A i 2 ,1 ' ' ■ A im, 1 

\(l) . . . \ (d) 

*2,2 im ,2 g j^Afx(l+dm) 

V1 

\ (l ) 

\,N 

Ad) 

\,N 

\(1) \(^) , 

A i 2 ,N ’ ’ ■ A im,N / 


is the design matrix, 

(3=(p 0 ,p 1 ,...,/3 dk ) T eR 1+dm 

is the parameter vector we would like to estimate and e E is a random error vector, 
which is multivariate standard normal under the null hypothesis. 

In order to estimate (3 the method of least square is used and we obtain the estimated 
parameter vector 

/3 = (D j t Dj)- 1 D j t Y, 

the vector of fitted values 


Y = Dj(Dj T D j )” 1 Dj T Y = Dj/3, 


and the residual vector e = Y — Y. 

Under the null hypothesis we have that 


/3 = (0,0,..., 0) T E M 1+dm and e ~ A/iv(0, I N )- 


To test the assumption that e is standard normal one can use a normality test such as 


the Anderson-Darling or Shapiro-Wilk (Anderson and Darling, 1954; Shapiro and Wilk 


1965, Yap and Sim 2011). This yields a p-value tt nor mai- To test the other assumption we 


consider the test statistic 


/3 r (Dj r Dj)/3 

(1 + dm) a 2 ’ 
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N = 20 

Fi 

f 2 

f 3 

t 4 

F\ 

0.024 

0.026 

0.025 

0.025 

f 2 

0.884 

0.025 

0.825 

0.884 

f 3 

0.024 

0.025 

0.238 

0.024 

f 4 

1.000 

0.880 

0.999 

1.000 


O 

II 

Fi 

f 2 

f 3 

T 4 

Fi 

0.024 

0.022 

0.025 

0.024 

Fo 

1.000 

0.027 

1.000 

1.000 

f 3 

0.026 

0.023 

0.734 

0.026 

t 4 

1.000 

1.000 

1.000 

1.000 


Table 5: Monte-Carlo power of the LRA test at significance level a = 0.05 for different 
sample sizes. Details are given in Example |5.3[ 


where 


<r 2 = 


IY - Y| 


N — (l + dm) 

is the unbiased variance estimator. The test statistic Fq has a Fisher distribution with 
1 + dm and N — 1 — dm degrees of freedom; see for example Montgomery et al. (2001). 
The p- value irp is then 

TTF = 1 — Fi + dm,N-l-dm{Fo), 


where F PtQ denotes the Fisher cumulative distribution function with p and q degrees of 
freedom. Combining these two tests by the method of Holm leads to the adjusted p-value 


Tfadjust — 2 min{7Tp, 7T norma l}. 


We need that rank(Dj) = 1 + dm, otherwise the regression analysis is not possible. 
Therefore, any forecaster F^t, i E J has to predict for each parameter at least two distinct 
values. Otherwise, we omit the parameter for this forecaster in the model and are still 
able to use the test, which we call the LRA test with respect to J. 


Example 5.3 (Example 3.1 continued). Recall the forecasters F\,F 2 , F 3 and F 4 from 
Example |3.1| Then all four forecasters are in the class of distribution functions T = 
{Ea|A = (p, g,t) gMx (0, 00 ) X {—1, 0,1}} for F\ = \{J\f{p, a) + J\f(p + r, <r)}. We apply 
the LRA test for all combinations of forecasters for sample sizes N = 20 and N = 50. 
The Monte-Carlo powers of vr a dj ust for 10T00 simulations are given in Table [5j We only 
listed the Monte-Carlo powers with respect to individual forecasters, since we omit some 
parameters to have a design matrix with full rank. Therefore, testing cross-calibration 
with respect to J C {1, 2, 3, 4 } leads to the same test as testing with respect to F 3 if 3 G J 
and testing with respect to F\ otherwise. For testing standard normality, we have used 
an Andersen-Darling test (with mean set to zero and variance set to one). In the cases of 
cross-calibration, the normality test never rejects the null hypothesis, which explains the 
conservative levels of around 0.025 in these cases. The test is powerful even for the small 
sample sizes and it provides the expected results from the theoretical considerations; see 
Table [2] In particular, the LRA test detects well, that F 3 is not ideal with respect to itself 
contrary to the CEP test; compare Table [3j 


Example 5.4 (Example |2.5| continued). Coming back to forecasters F\ and F% from 
Example |2.5| we perform the F-tests for different sample sizes N. The Monte-Carlo powers 
of the tests for lOTOO simulations can be found in Table [6] The Monte-Carlo powers are 
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N 

20 

50 

100 

200 

1000 

5000 

Fi wrt Ei 

0.051 

0.049 

0.048 

0.049 

0.053 

0.047 

El wrt E 2 

0.053 

0.05 

0.045 

0.05 

0.05 

0.05 

Ei wrt Ei, E 2 

0.05 

0.048 

0.046 

0.049 

0.051 

0.048 

E 2 wrt Ei 

0.092 

0.105 

0.114 

0.122 

0.139 

0.135 

E 2 wrt E 2 

0.051 

0.05 

0.045 

0.048 

0.051 

0.049 

E 2 wrt Ei, E 2 

0.081 

0.092 

0.097 

0.108 

0.121 

0.119 


Table 6: Monte-Carlo power for F-test for different sample sizes N and lO'OOO simulations 


in Example 5.4 


low even for large sample sizes, contrary to the results of the CEP tests; compare Table [4} 
We do not report the power of LRA test in this example because the Anderson-Darling 
test for standard normality almost never rejects the null hypothesis. 


5.3 Score regression approach 


Held et al. (2010) suggest a significance test for ideal forecasters based on scoring rules. 


They use the continuous ranked probability score (CRPS) (Gneiting et al. 2005) and the 


Dawid-Sebastiani score (DSS) ( |Dawid and Sebastiani 1999). Their approach relies on 
independent forecast-observation tuples, and this restriction remains, when generalizing 
their approach to a test for cross-ideal forecasters. Therefore, throughout this section we 
work in a one-period prediction space. 

First, we recall some preliminaries on the CRPS and the DSS. Let E and / denote 
the predictive CDF of a forecaster and the predictive density function, respectively. Let 
)jl and a -2 be the predictive mean and variance, respectively]^] The observed value of Y is 
denoted by y. The CRPS is given by 


CRPS{F , y) = / {E(x) - 1 {y< x)} 2 dx 


and the DSS by 

DSS(F , y) = ^ {log(fr 2 ) + y 2 }, 

where y = (y — fi)/a. For a forecaster predicting a normal distribution F = A f(y, a 2 ), the 
CRPS turns out to be 


CRPS(F 1 y) 


a 


y{2<5>(y) - 1} + 2 <j)(y) - 




where <J> and (j) are the CDF and the density of a standard normal distribution, respectively 

. The CRPS is a strictly proper scoring rule relative to the 

3 In the prediction space setting, the quantities F, /, y, and a are random Ao-measurable quantities. 
For ease of presentation, in this section we treat them as as if they were deterministic, or, alternatively, one 
should consider all expectations and variances as conditional on the forecaster’s information set Ao C A. 


(Gneiting and Raftery, 2007 
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class of probability measures with finite first moments; see Gneiting and Raftery (2007) 
for details on proper and strictly proper scoring rules. 

For a normal prediction the DSS is the same as the classical logarithmic score LS(f, y ) = 
— log{/(y)} up to a constant. The DSS is a proper scoring rule relative to the class of 
probability measures with finite second moment. It is strictly proper relative to any class 
of probability measures that are characterized by their first two moments, such as Gaussian 
measures or other location-scale families of distributions (Gneiting and Raftery, 2007). 

We assume now that mean and variance of the predictive distribution F match mean 
and variance of the outcome Y. The following properties of the CRPS and the DSS can 
be found in Held et al. (2010). For the DSS we get 


E{DSS(F,y)} = -+log(a). 


( 10 ) 


If the distribution of Y has finite fourth moment then vtu:{DSS(F, Y)} is a constant that 
does not depend on p, or a. If Y has a normal distribution then var {DSS(F,Y)} = \. 


Similar results for the CRPS are harder to obtain. Held et al. (2010) show the following 
lemma. 

Lemma 5.5. Let Xq be a random variable with finite second moment. For a G M and 
b > 0, letY = a + bX q ; let F be the CDF of Y, and a 2 its variance. Then, 


E{CRPS (F,Y)} = da and var{CRPS(F, y)} = D a 2 


(ID 


where 


d = 


E|Xp-X'| 
2 v / var(X 0 ) 

with Xq an independent copy of Xq . 


and D = 


var{E( |Xo-jg |X 0 )} 
var(X 0 ) 


The lemma shows that for location-scale families of distributions the expected CRPS of 
an ideal forecast is proportional to the predictive standard deviation a, and the variance 
of the CRPS is proportional to the predictive variance a 2 . For the family of normal 
distributions we have d = 1 /\pK and D = {1/3 — (4 — \/l2)/7r} ~ 0.16275. The constants 
for other families can be calculated at least numerically. 

For the score regression approach, we consider N independent and identically dis¬ 
tributed observations (F\ jn , ..., F kn , Y n , V n ), 1 < n < N in the prediction space setting; 
see Definition 2.1 The expectation of the DSS depends on the logarithm of the predictive 
standard deviation; see equation (10). Therefore, we stipulate a regression model of the 
form 

DSS(Fi tn , Y n ) = o + bi log(<7i in ) + ... + b k log(cr fc)ri ) + e n , 

where oy n is the predictive standard deviation of F l)U and e n is an independent error with 
mean zero. Since the variance of the DSS is constant, irrespectively of the predictive vari¬ 
ance, we can use a homoscedastic regression model to compute the least sq uares est imators 


Held et al. 


(2010 


d, &i,..., bk- In the case k = 1 this is the same model as proposed at 
eq. (7)). We need to assume that the scores have finite variance, which is fulfilled if Y has 
a finite fourth moment (conditional on A \). 
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For the CRPS, motivated by we stipulate the regression model 
CRPS(F\ n , T/) — c- f- d\(j\, n T • • - T T e n . 


We have var{ CRPS(F\^ n , Y n )} cc cri^ n and use a weighted r egressio n analysis with weights 
l/crgn to obtain estimators c,d\,... ,dk\ see for example Montgomery et al. (2001). 

Both of these models can be used for testing if the forecaster F\ is cross-ideal with 
respect to A\ = cr(F\),... ,Ak = u(E*.) in case of a normal forecaster F \. The DSS can 
also be used if the prediction is non-normal as emphasized in Held et al. (2010). The 
CRPS model is useful for any location-scale family of distributions. 

We have the null hypotheses 


Hq : a = 1/2, b\ = 1 and 1)2 = ■ ■ ■ = b}- = 0 for DSS, 

Hq : c = 0, d\ = l/\/7r and o ?2 = • • • = dfc = 0 for CRPS, 

and perform a \ 2 -test. We use the test statistics 

Tdss = (a - 1/2, b\ — 1, & 2 , • • •, ~ V 2 > — 1, &2, - - -, &fc) T , 

Tcrps = (c,di - l/\/vf,d 2 , • • • ,dk)^cRPs(F^ ~ Wt ^ 2 , • • ■, d k ) T , 

where S_dss, ^crps ar e the estimated covariance matrices. Both test statistics, Tdss an d 
Tcrps , ar e asymptotically ^-distributed with 1 + k degree of freedom and asymptotic 
p-values are given by 

kdss = 1 - Xi+k( T DSs), and tt C rps = 1 - Xi+A t crps)- 


If k = 1, that is the case of just one forecaster, we obtain the test for an ideal forecaster 
suggested in Held et al. (2010). We call the tests presented in this section SRA tests as they 
are based on a score regression approach (SRA). As noted already by Held et al. (2010), 
SRA tests can only be used if each forecaster predicts at least two different variances, 
therefore we cannot apply it to Example 3.1 by Gneiting and Ranjan (2013). Instead, we 
consider the following setup for illustration. 


Example 5.6. We consider two forecasters F\^ n = AA(0, (1 + a n ) 2 ) and F 2 )U = A/"(0, (1 + 
o'n + en) 2 ), where a n ~ U([ 0,1]) and e n ~ AA(0,1/16). The observations are Y n ~ AA(0, (1 + 
cr n ) 2 ). It is clear, that F\ is ideal with respect to A\ = (r(a n ), F 2 is not ideal with respect 
to A 2 = a(a n ,e n ), F\ is cross-ideal with respect to Fj, F 2 , and F 2 is not cross-ideal with 
respect to Fj, F 2 ■ In Table [7| the Monte-Carlo powers of the CRPS tests are displayed. 
The tests are performed at significance level a = 0.05. The results are in accordance 
with the theoretical considerations. However, the Monte-Carlo power of the test if F 2 is 
cross-ideal with respect to Fj, F 2 is higher then in the test if Fj> is ideal with respect to 
F2. It is interesting to see that taking F\ into account helps to detect that F 2 is not ideal 
with respect to F 2 . 


Example 5.7 (Example 2.5 continued). Considering again Example 2.5 we used the 
DSS test to assess if the forecasters are cross-ideal. In Table [8] we present the Monte- 
Carlo powers of the tests which are performed at significance level a = 0.05. The CRPS 
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N 

30 

50 

100 

200 

500 

F\ wrtEi 

0.083 

0.073 

0.061 

0.056 

0.050 

F\ wrtEi, F 2 

0.094 

0.073 

0.064 

0.059 

0.055 

F ‘2 witF 2 

0.240 

0.285 

0.429 

0.707 

0.962 

F 2 wrtEi, F 2 

0.292 

0.376 

0.582 

0.852 

0.998 


Table 7: Monte-Carlo powers for the CRPS tests with sample sizes N and lO’OOO simula¬ 
tions described in detail in Example |5.6[ 


N 

30 

50 

100 

200 

500 

F\ wrtEi 

0.085 

0.075 

0.064 

0.055 

0.056 

F\ wrtE\, F 2 

0.084 

0.072 

0.065 

0.057 

0.054 

F 2 wrtF 2 

0.111 

0.096 

0.090 

0.074 

0.072 

F 2 wrtEj, F 2 

0.286 

0.416 

0.718 

0.963 

1.000 


Table 8: Monte-Carlo power for the DSS tests with sample sizes N and lO’OOO simulations 


described in detail in Example 5.7 


cannot be used for F 2 since the forecast is not normal. As expected, the tests show that 
F\ is ideal with respect to Mi and also cross-ideal with respect to Mi, A 2 ■ For a sample 
size of A^ = 200 the level of the test is kept reasonably well. The forecaster F 2 is ideal 
with respect to A 2 but fails to be cross-ideal with respect to Mi,M 2 . The test shows a 
good power already for a sample size of N = 100. However, for F 2 the test is slightly 
anticonservative even for a sample size of N = 500. 


5.4 Summary 


We have presented three different approaches for testing cross-calibration, the CEP tests 
in Section 5.1, the LRA tests in Section 5.2 and the SRA tests in Section 5.3, The 


first two approaches allow to test for cross-calibration of F\ with respect to any subset 
J C {1,... ,7’}, whereas the SRA tests only allow to test for F\ being cross-ideal which 
is equivalent to requiring that 1 g J. The CEP test and the LRA test with respect to 
J = 0 are tests for probabilistic calibration, that is, the classical hypothesis of uniformity 
and independence of PIT values. While the SRA tests require independent forecast- 
observation tuples, the CEP and the LRA tests are formulated in a prediction space for 
serial dependence, which is a scenario that is frequently encountered in practice; see also 
Section [6l 

The CEP test has the advantage that it provides information concerning the parts 
of the distribution where miscalibration is detected (in terms of quantile levels); this is 
illustrated in Figures [5] and [6j It may be considered a disadvantage that the adjusted 
p- values are simulation based and depend on a grid 0 < z\ < ■ ■ ■ < zm < 1 that is to 
be chosen. In simulations, the method has shown to be robust to the number M of grid 
points. On the contrary, the p- values for the LRA test are given explicitly. The forecasters 
have to be described through a finite-dimensional parameter vector and there are some 
restrictions concerning the predictive parameters, as it has to be ensured that the design 
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matrix Dj at ([9]) has full rank. For the forecasters of Example |3.1 


the LRA test has 

overall a better power than the CEP test; see Examples 5.1 and 5.3 The difference is 
minor, except for the hypothesis that the forecaster F 3 is ideal. Here, for sample size 
N = 50, the LRA test achieves a power of 0.734, whereas the CEP test only has a power 
of 0.168. For the forecasters in Example |2.5| the CEP test outperformed the LRA test; 


see Examples |5.2| and |5.4| In fact, for sample size N = 200, the power of the CEP test is 
more than three times higher than the power of the LRA test. 

The following modifications of the CEP and the LRA tests are straight forward but 
unexplored. The logistic regression model in ([5]) can be replaced by any other regression 
model for a binary outcome variable, where it is possible to formulate a test for an anal¬ 
ogous pointwise null hypothesis as given at 0. If forecasters choose their distributions 
from a parametric class of distributions as assumed in LRA approach, it could also be 
considered to regress the random variables B z = < z) on the predicted parameter 

values. In the LRA, the linear regression model stipulated at ([8]) can be replaced by some 
other regression model for a vector of real valued outcomes. 

We would like to remark that the CEP and the LRA tests are formulated in the pre¬ 
diction space setting for serial dependence and make use of condition ([I]). It appears that 
deciding whether this assumption is justified in a given application context is sometimes 
a delicate matter. For example, if a forecaster i bases her predictions purely on intuition, 
then (|Tj) is certainly justified. If a forecaster j uses a time series model for predictions, that 
is, predictions are exclusively derived from past data, then one may argue that assumption 
0 fails and the CEP and LRA tests should only be applied with respect to sets J such 
that j 0 J. It may be that some parameters of a predictive distribution are derived from 
past data, whereas others are from external sources such as expert opinion. Here, it could 
be argued that one should only regress on the latter type of parameters in the LRA tests 
and use a regression model in terms of these parameters for the CEP tests. A different 
point of view would be that the parameters based on past data are derived through a 
subjectively chosen model, thus after the fitting procedure they should rather be viewed 
as personal opinion of the forecaster than as an information influencing the outcome. We 
will discuss condition 0 further in Section [6j 

The SRA tests, based on the score regression approach, require independent forecast- 
observation tuples, or, more precisely, independent sequences of realized score values 
CRPS(Fi jri , Y n ) or DSS(.Fi i71 , Y n ), 1 < n < N, which may be a weaker requirement. They 
are asymptotic tests, that appear to be working well for sample sizes of at least N = 100; 
see Tables [7] and [8j The SRA test with the CRPS works only for predictive distributions 
from one location-scale family, whereas the SRA test with the DSS requires only that the 
predictive distributions have finite fourth moments. In both cases, the predictive standard 


deviations have to differ for at least two observations. For the forecasters of Example 2.5 


the SRA test with the DSS showed a better power than the CEP test, so it is an inter¬ 
esting alternative despite the more restrictive assumptions; see Examples |5.2| and 5.7 In 


particular, for sample size N = 200 the SRA test had a power of 0.963 detecting that F 2 
is not cross-ideal with respect to Fi, F 2 , whereas the CEP test had a power of 0.464. 

In the case of independent forecast-observation tuples it is possible to derive a test for 
marginal cross-calibration by testing for mean zero in Q for each y E M. It has turned out 
in simulations, that the resulting asymptotic test has several problems for applications. 
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For completeness, we report these findings in Appendix [B| 


6 Data example 


The Bank of England (BoE) predicts the inflation rate of every quarter by using a proba¬ 
bilistic forecast with a potentially asymmetric two-piece normal distribution with param¬ 
eters /j£M and oq, oq > 0 and density 


f(y) = 


j-1/2 

v-1/2 


Ol + 02) 1 eX P ( - 
(01 + 0 " 2 ) _1 exp (- 


(y-y) 2 

2cr^ 

(y-M) 2 % 

2cr 2 


if y < v, 
if y > 


(12) 


The forecasts have been issued by the BoE’s Monetary Policy Committee since February 
1996 for the first quarter of 1996 and are publicly available online. The first quarter is 
from March to May, the second quarter from June to August, and so forth. Furthermore, 
there are forecasts available which have been issued between February 1993 and May 1997. 
These were converted into density forecasts retrospectively. Until the first quarter of 2004, 
the forecasts have been issued to predict RPIX inflation rates. But since the first quarter of 
2004, inflation has been predicted and assessed in terms of percentage changes over twelve 
months of the CPI. The observed RPIX as well as the CPI inflation rates are available 
from the Office for National Statistics under codes CDKQ and D7G7, respectively. There 
is no simple transformation that converts an RPIX inflation rate into a CPI inflation rate 
and vice versa, so we have analysed the two data sets separately; RPIX inflation rate 
predictions from the first quarter of 1993 to the first quarter of 2004 and CIP inflation 
rate predictions from the first quarter of 2004 to the first quarter of 2015. In both cases we 
have 45 forecast-observation tuples. For further detail on the data set, see Gneiting and 
Ranjan (2011 Section 4.1). The BoE inflation forecasts have also been previously analysed 


for example by Wallis| (2003); Clements (2004); Mitchell and Hall (2005); Galbraith and 
van Norden (2012). 


For both data sets, we compare the BoE predictions with a Gaussian autoregression 
(AR) of order one with rolling estimation window of length six quarters, which leads 
to Gaussian density forecasts. The prediction horizon we consider is one quarter. As 
discussed in Section 5.4, the CEP and LRA tests make use of condition ([!]). While we 
believe that the BoE forecasts can be assumed to satisfy 0 , it is more debatable in the 
case of the AR forecasts. If one is not willing to believe that ([I]) holds in this case, one 
should only consider the CEP and the LRA tests with respect to the empty set, that is 
probabilistic calibration, and cross-calibration with respect to BoE. The conclusions we 
can draw about the quality of the forecasts remain essentially the same. Due to the serial 
dependence in the data, we do not apply the SRA tests. 

First, we consider the CEP tests. The results for the BoE density forecasts can be seen 
in Figure[5]and the ones for the AR forecasts in Figurejfij In both plots the grid is z m = {1+ 
(148/149)m}/150 for 0 < m < 149 and 20T00 bootstrap replications are used to calculate 
the adjusted p-values under the null hypothesis. For the RPIX inflation rate forecasts in 
the top panel of Figure [5} the BoE forecast seems to be probabilistically calibrated and also 
cross-calibrated with respect to the AR forecast. It fails to be ideal, that is cross-calibrated 
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Figure 5: The p -values of the CEP tests for the BoE forecast. The top panel corresponds 
to the prediction of RPIX inflation rates, whereas the bottom panel displays the results 
for CPI inflation rates. The solid horizontal lines give 0.05 level; the solid lines refer to 
probabilistic calibration (cross-calibration with respect to the empty set); the dashed lines 
refer to cross-calibration with respect to AR, the dash-dotted lines with respect to BoE 
and the dotted lines with respect to BoE and AR. 


with respect to itself. As a theoretical consequence it also fails to be cross-calibrated with 
respect to both, the AR forecast and itself by Theorem 2.9 The CEP test picks this up 


correctly, and rejects the null hypothesis with respect to BoE or with respect to BoE and 
AR for some small exceedance probabilities between zero and 0.05. However, it should be 
remarked that the rejection region is small allowing the tentative conclusion that the BoE 
forecast is not far from being ideal or cross-ideal. For the CPI inflation rate predictions 
in the bottom panel of Figure [5j the situation is different. Probabilistic calibration of the 
BoE forecast is rejected for exceedance probabilities between 0.13 and 0.26. Note that 
this result makes no use of assumption ([!]). Cross-calibration with respect to AR and with 
respect to BoE itself is also rejected in some parts of the region between 0.13 and 0.26. In 
this case, the CEP test is not able to pick up a failure of cross-calibration with respect to 
both, AR and BoE, although this is a theoretical consequence of the lack of probabilistic 
calibration by Theorem |2.9| 

According to the CEP test, the AR forecast for the RPIX inflation rate is not proba¬ 
bilistically calibrated and therefore also not cross-calibrated, ideal or cross-ideal; see the 
top panel of Figure [6j For all tests, the forecaster fails in the region of exceedance prob¬ 
abilities lower than 0.4 and near to 1. Cross-calibration with respect to BoE and AR is 
rejected for all exceedance probabilities with a very low p-value. While the overall con- 
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RPIX 

BoE wrt 0 

BoE wrt AR 

BoE wrt BoE 

BoE wrt BoE, 

AR 

F-test 

0.338 

0.185 

0.010 

0.010 


AD-test 

0.496 

0.589 

0.822 

0.731 


adjusted 

0.676 

0.370 

0.021 

0.020 


CIP 

BoE wrt 0 

BoE wrt AR 

BoE wrt BoE 

BoE wrt BoE, 

AR 

F-test 

0.3973 

0.5629 

0.1486 

0.2228 


AD-test 

0.0102 

0.0122 

0.0073 

0.0047 


adjusted 

0.0203 

0.0245 

0.0146 

0.0093 



Table 9: The p -values for the LRA tests for the BoE forecast. 
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Figure 6: The p -values of the CEP tests for the AR forecast. The top panel corresponds 
to the prediction of RPIX inflation rates, whereas the bottom panel displays the results 
for CPI inflation rates. The solid horizontal lines give the 0.05 level; the solid lines refer 
to probabilistic calibration (cross-calibration with respect to the empty set); the dashed 
lines refer to cross-calibration with respect to BoE, the dash-dotted lines with respect to 
AR and the dotted lines with respect to BoE and AR. 
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RPIX 

AR wrt 0 

AR wrt AR 

AR wrt BoE 

AR wrt BoE, AR 

F-test 

0.193 

0.136 

0.325 

<0.001 

AD-test 

0.003 

0.027 

0.025 

0.570 

adjusted 

0.006 

0.054 

0.049 

<0.001 

CIP 

AR wrt 0 

AR wrt AR 

AR wrt BoE 

AR wrt BoE, AR 

F-test 

0.5109 

0.1668 

<0.0001 

<0.0001 

AD-test 

0.0001 

0.0004 

0.1784 

0.5436 

adjusted 

0.0002 

0.0008 

<0.0001 

<0.0001 


Table 10: The p-values for the LRA tests for the AR forecast. 


elusions remain the same for the CPI inflation rate forecasts, the situation is somewhat 
different; see the bottom panel of Figure [6j Cross-calibration with respect to BoE and 
with respect to AR and BoE is rejected for almost all exceedance probabilities. However, 
probabilistic calibration of the AR forecaster and cross-calibration with respect to itself 
is only rejected for some probabilities below 0.10 and above 0.80 indicating the the AR 
forecast might be superior to the BoE forecast for exceedance probabilities between 0.13 
and 0.26. 

Secondly, we consider is the LRA tests. The parametric class J~ used for the tests is 
the class of two-piece normal distributions with parameters p 6 M, o\ > 0, a -2 > 0 given at 


(12). We can perform all the tests as for the CEP. The corresponding p -values can be found 
Tables [9] and [lOj We also see if the estimated regression parameter failed to be zero 


m 


or the standard normality assumption for the residuals was violated. The results coincide 
with the ones from the CEP, but we do not see in which region of exceedance probabilities 
the forecasters failed. On the other hand, in this application the LRA tests are consistent 


with Theorem 2.9 in the sense that rejection of cross-calibration with respect to a smaller 


set implies rejection with respect to any superset. 


7 Discussion 


We have extended the prediction space setting of Gneiting and Ranjan (2013) to accom¬ 
modate serially dependent forecasts which are commonly encountered in practice. For 
prediction spaces with serial dependence, we have shown a refined version of the result of 


Diebold et al. (1998) on uniformity and independence of PIT values. It relies on condition 
Q, whose implications should be studied in greater detail. We have focussed on the case 
of one period ahead forecasts like in the original result. As mentioned in Remark [2j an 
analogous result continues to hold for g-step ahead forecasts. However, additional com¬ 
plications arise in testing for cross-calibration, which need further investigation in future 
research. 

We have refined the notions of calibration to notions of cross-calibration and we have 
provided powerful statistical tests for these properties requiring minimal assumptions on 
the sequences of forecasts and observations. The characterization of cross-calibration and 


cross-ideal forecasters in Proposition 2.10 sheds some light on the difference between ideal 


forecasters and probabilistically calibrated forecasters as discussed in Gneiting and Ranjan 
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(2013). It is remarkable to note that with our approaches, testing for ideal forecasters is 
not more difficult than testing for probabilistic calibration, contrary to the doubts voiced 
in 


Gneiting and Ranjan (2013). 


In order to optimize forecasting performance, it is natural to combine forecasts. Gneit¬ 


ing and Ranjan (2013) have proposed combination formulas and aggregation methods to 


combine several forecasters; see also Ranjan and Gneiting (2010). It would be interesting 
to consider under which conditions, calibrated forecasters can be combined to yield cross- 
calibrated forecasts. Also, the more refined notions of cross-calibration in this paper, may 
help to identify which forecasters to include in combination formulas and which ones do 
not add additional information about the future outcome. Finally, combining forecasts 
is only a good idea if the predictions are based on different information sets. If there is 
a cross-calibrated forecaster with respect to all forecasters, any combination of forecasts 
would compromise on forecast quality. 

Our approach may add another perspective on the concerns raised by [Mitchell and 


Wallis (2011) concerning the principle to “Maximize sharpness subject to calibration” for¬ 


mulated by Murphy and Winkler (1987); Gneiting and Raftery (2007). In fact, the concept 
of cross-calibration allows to assess the statistical compatibility of several forecasters with 
the observations. When considering calibration and sharpness as suggested by [Gneiting 


and Raftery (2007), calibration concerns the interplay of one forecaster and the obser¬ 


vation, whereas sharpness compares forecasters but makes no reference to observations. 
Possibly, the formulated guiding principle should be modified to “Maximize sharpness 
subject to cross-calibration” which is a stronger requirement in terms of calibration and 
therefore gives somewhat less importance to sharpness. 
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A Calculations for Example 3.1 


Let /x ~ A(0, 1) and let r takes values 1 or —1 with equal probability independent of 
/x. Conditional on y and r, the observation is Y ~ jV(/x, 1) and the forecasters have the 
following predictive distribution functions: 


Fi(y) = ®{y - v), 

F 3(y) = ^(y -fj) + 7,$(y -h-t), 
Fi{y) = ${y + y) 


for y G M. As in Gneiting et al. (2007), we use the definitions x L+(x) = ^{<h(x) + $(x — 1)}, 


'L-(x) = i{<h(x) + <h(x + 1)}. Thus, 4/_( x ) = T + (a: + 1) and T_ 1 {'L + (x + 1)} = x. 


Proposition A.l. The unfocused forecaster is probabilistically cross-calibrated with 
respect to F \, F 2 , F 4 . 


Proof. Let y G (0,1). We have 


Q(Z F3 <y\Fi,F 2 , F 4 ) = Q(Z Fs < y\n) 

= - y) < 2/1/4 + -n)< 2/1/4 

= + ^{'LI 1 (2/)} = y. 

□ 


B Testing for marginal cross-calibration 


We consider two forecasters F\ and F 2 within the prediction space setting. Our interest 
lies in 

S(y) = F 2 (y) - 1 {F^iZpJ <y}, ye supp(F), 

where supp(A) denotes the support of of the observation Y. We would like to test if 
Eq S(y) = 0 for all y G supp(A), because this is equivalent to marginal cross-calibration 


of F\ with respect to F 2 \ cf. Definition 2.7 


We suppose that we have A independent and identically distributed observations 
(F\ >n , F 2jU , Y, j, V n ) for 1 < n < A in a prediction space and define for each n, 

Sn(y) = F 2 ,n(y) - 1 {F 2} n(Zi,n) < ?/}, 2/ e supp(Y). 

We pick a sequence x/o < yi < ■ ■ ■ < 2/m in the support of Y and define 


S n = ( S n (y 0 ), S n (yi ),..., S n {y m )) T , 

and S N = (1/A) J2n=i S n- Let Ejv = (1/A) i(S n - S N )(S n - S N ) T be the sample 
covariance matrix. If F\ is marginally cross-calibrated with respect to F 2 , then E(S n ) = 0. 
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Figure 7: The p- values of the marginal cross-calibration test for five different simulated 
data sets and an increasing number of grid points. The horizontal line marks the a = 0.05 
significance level. 


Therefore, by standard arguments of probability theory, the test statistic T = N S^S^Sn 
converges in distribution to Xm- a chi-squared distribution with m degrees of freedom. 

We test the null hypothesis that IE qS(y) = 0 for all y E supp(y) for one particular 
finite distribution of S(y). Therefore, the sequence y±,... ,y m has to be chosen carefully. 
Simulations indicate that the level and power of the test is not much affected by the 
choice of yi,... , y m . However, for small sample sizes the number of grid points m should 
be rather small, otherwise the sample covariance matrix may be singular and can not 
be inverted to compute the test statistic. Another reason for singularity of the sample 
covariance matrix for small sample sizes may be the choice of a grid point y such that 
the probability that } < y is small. Unfortunately, for an individual test case, 

different choices of yi,... , y m may lead to completely different p- values, which makes the 
test useless in practice. We illustrate these effects in the following example. 


Example B.l 

ample 


3.1 


We consider the forecasters F\ ,..., F4 and the observation Y from Ex- 
Let N = 500 be the number of observations from (Ej, Fj,Y), for each pair Fi 
and Fj with 1 < i. j <4. The results in Table 11 show that the marginal cross-calibration 
test performs well overall, and the performance is relatively unaffected by the choice of 
different grid points yi,y 2 , ■ ■ ■ ,Vm■ However, if we consider an increasing number of grid 
points for the same data set the p -value changes substantially. This is illustrated in Figure 
[7] for five different simulated data sets with N = 500 and the null hypothesis that E 3 is 
marginally cross-calibrated with respect to F4. 
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m = 9 

Fa 

f 2 

f 3 

Fa 

F\ 

0.066 

0.0694 

0.0655 

0.0622 

f 2 

1 

0.0671 

1 

1 

f 3 

0.0689 

0.0708 

0.5782 

0.0597 

Fa 

1 

1 

1 

0.0668 

m = 4 

Fi 

f 2 

f 3 

Fa 

F\ 

0.0556 

0.0545 

0.0511 

0.0578 

f 2 

1 

0.0555 

0.9972 

1 

f 3 

0.0531 

0.0534 

0.5122 

0.0563 

Fa 

1 

1 

1 

0.0554 

m = 3 

Fa 

f 2 

f 3 

Fa 

F\ 

0.0526 

0.0568 

0.0519 

0.0545 

f 2 

0.9993 

0.0524 

0.986 

1 

f 3 

0.0532 

0.0566 

0.431 

0.0586 

Fa 

1 

1 

1 

0.0521 


Table 11: Monte-carlo powers of the marginal cross-calibration 

tests for sample size N = 500 and grid points (3/1 , 3/2 5 - - • j 2/9) = 

(-1.81, -1.19, -0.74, -0.36, 0.00, 0.36, 0.74,1.19,1.81), (2/1,2/2,2/3,2/4) 

(-1.19,-0.35,0.35,1.19), ( 2 / 1 , 2 / 2 ? 2 / 3 ) = (—0.95,0,0.95), respectively, for the first, 

second and third table. The value in the i-th row and j-th column is the percentage 
of rejections of the null hypothesis that F t is marginally cross-calibrated with respect 
to Fj at level a = 0.05 for the forecasters F\, F 2 , F- 3 and F4 in Example B.l in lO’OOO 
simulations. 
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